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Abstract 

We study from a physics viewpoint a class of generative 
neural nets, Gibbs machines, designed for gradual learn¬ 
ing. While including variational auto-encoders, they of¬ 
fer a broader universal platform for incrementally adding 
newly learned features, including physical symmetries. 
Their direct connection to statistical physics and informa¬ 
tion geometry is established. A variational Pythagorean 
theorem justifies invoking the exponential/Gibbs class of 
probabilities for creating brand new objects. Combining 
these nets with classifiers, gives rise to a brand of uni¬ 
versal generative neural nets - stochastic auto-classifier- 
encoders (ACE). ACE have state-of-the-art performance 
in their class, both for classification and density estima¬ 
tion for the MNIST data set. 
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1 Introduction. 

1.1 Universality. 

We buck the recent trend of building highly specialized 
neural nets by exploring nets which accomplish multi¬ 
ple tasks without compromising performance. An uni¬ 
versal net can be tentatively described as one which, 
among other things: i) works for a variety of applications, 
i.e. visual recognition/reconstruction, speech recogni¬ 
tion/reconstruction, natural language processing, etc; ii) 
performs various tasks: classification, generation, prob¬ 
ability density estimation, etc; iii) is self-contained, i.e., 
does not use specialized external machine learning meth¬ 
ods; iv) is biologically plausible. 

1.2 Probabilistic and quantum viewpoint 
on generative nets. 

The input of a neural net is typically a P x N data 
matrix X. Its row-vectors span the space of 

observations, its column-vectors where l...iV 

can for example enumerate the pixels on a screen, span 
the space of observables. The net is then asked to per¬ 
form classification, estimation, generation, etc, tasks on 
it. In generative nets, this is accomplished by randomly 
generating L latent observations for every ob¬ 

servation x^. This induced “uncertainty” of the /i-th state 
is modeled by a model conditional density p(z|x^). It 
is the copy-cat, in imaginary time/space, of the (squared) 
wave function from quantum mechanic^ and fully de¬ 
scribes the p-th conditional state ^x^, In 

statistical mechanics parlance, the latents are fluctuating 
microscopic variables, while the macroscopic observables 
are obtained from them via some aggregation. In the ab¬ 
sence of physical time, observations are thus interpreted 
as partial equilibria of independent small parts of the ex¬ 
panded (by a factor of L) original data set. Simply put, 
every visible observations is surrounded by a “cloud” of 
virtual observations. Creating a new original observation 
amounts to nothing more than sampling from that cloud. 

* Strictly speaking, we will employ unbounded densities and hence 
stochastic analysis formalism and its centerpiece - the diffusion equa¬ 
tion. But they are formally equivalent to the quantum-mechanical for¬ 
malism and its centerpiece - the Schrodinger equation - in imaginary 
time/space coordinates. 


The quality of the model conditional density, or more 
generally - the model joint density p(z,x^) - is judged 
by the “distance” from the implied marginal density g(x) 
:= f p(x,z)dz to the empirical marginal density r(x). 
This distance is called cross-entropy or negative log- 
likelihood'. 


-log£(r||q) := Er(x)[-logg(x)], (1.1) 

where Er()[.] is an expectation with respect to r(). Its 
minimization is the ultimate goal. 


1.3 Equilibrium setting. Gibbs machines. 


The equilibrium i.e. small fluctuations viewpoint of sta¬ 
tistical mechanics appears to have been originated by Ein¬ 
stein in a then-unpublished 1910 lecture Einstein p006| . 
He used an exponential model density in space 

and derived the Brownian diffusion i.e., Gaussian model 
density p^{) in space/time. They are special cases of 
a broad class of densities - Gibbs or exponential densi¬ 
ties - which form the foundation of classic statistical me¬ 
chanics. Gibbs densities are variational maximum-entropy 
densities and hence optimal for modeling equilibria. They 
also offer a platform for adding incrementally new macro¬ 
scopic descriptive variables, [Landau and Lifshitz| 11980) , 
section 35. 

We argue in sub-sections 2.6| 3.7 that Gibbs densi¬ 
ties are also optimal for modeling fully-generative equi¬ 
librium nets, and call those nets Gibbs machines. They 
were inspired by the first fully-generative nets - the varia¬ 
tional auto-encoders (VAE) |Engma and Welling ]2014| , 
Rezende et al. 120141 - and employ the same upper bound 


for the cross-entropy target Like their physics coun¬ 
terparts, Gibbs machines offer a platform for mimicking 
the gradual nature of learning: already learned symmetry 
statistics like space/time symmetries, can be added incre¬ 
mentally and accelerate learning, sections [L^ |2.3| |4] 


1.4 Observation entropies 

Unlike equilibrium statistical mechanics, human data is 
decidedly non-equilibrium in nature and exhibits large 
fluctuations and non-Gaussian behavior. Quantifying 
non-Gaussianity and “distance” from equilibrium is not 
easy when dealing with large number of observables N 
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and observations P. Luckily, there is a one-dimensional 
proxy for non-Gaussianity of a multi-dimensional data 
set; the non-Gaussianity of the negative Gaussian log- 
likelihoods { — log 


M- 


C(z) +const, with a multivariate Gaussian 


C(z)) as model density and C(z) the empirical 
covarianc^ 

In a bold and counter-intuitive re-read of Boltz 
mann’s statistical mechanics, Einstein interpreted the log 
likelihoods {logp‘^(z^)}^ as observation entropies 


Q) plot in Figure shows the Gaussianization effect of 
non-linearities and dropout for the MNIST data set LeCun 
|et al.| iT998| . Bounded non-linearities Gaussianize be- 
Here, — logp^(z^) = cause they are compressive in nature and “straighten out” 
the unlikely (with small entropies) observations, which 
we refer to as intricates. Dropout Gaussianizes because 
it drops latent variables and thus decreases the kurtosis. 


stein 


Ein- 


1 2006) . We will denote the Einstein entrop}0of an 


observation z^ by: 


5^(z^) := logp‘^(z^) -f const > 0. (1.2) 


The sub-section 1.2 viewpoint of an observation, as a par¬ 
tial equilibrium of multiple virtual observations, fits right 
into Einstein’s paradigm: entropy is defined in a classi¬ 
cal Boltzmann fashion, but on a cloud of virtual observa¬ 
tions. The visible observation stands out as the one with a 
locally maximum entropy. 

Observation entropies are central to modern theory 
of fluctuations [Landau and LifshTtz] ]1980| , chapter 12. 
They also have an elegant linear-algebraic incarnation 
in the singular value decomposition of the data matrix 
X (sub-section |5.1| l. In addition, their second moment 
E,. [{\ogp^{Zf,)Y\ is proportional to the multi-variate 
kurtosis, measuring the “fatness” of the probability den¬ 
sity of {z^}. 

1.5 The equilibrium curse. Intricates. 

Unfortunately, some of the key features of modern neu¬ 
ral nets, like non-linear activation functions and dropout 
Srivastava et al. | 2014| , come at the high price of Gaus- 
sianizing the data set, i.e. lead to higher-entropy, less in¬ 
formative configurations. The right quantile-quantile (Q- 




^If the latent observations come from an Ni^t- 

dimensional Gaussian distribution, the density of these negative Gaus¬ 
sian log-likelihoods is proportional to the familiar F{Niat, P — Niat) 
density [Mardia et al.||197^ , sections 1.8, 3.5. For the typical case 
P — Niat —^ it is proportional to the chi-squared density x% (). 
which in turn converges to a rescaled Gaussian Af(0,1), as Ni^t oo. 

^In order for the negative logs to be thought of as entropies, a large 
positive constant is added. It should be clear from the context, but we 
nevertheless use for clarity a superscript, to distinguish the Einstein en¬ 
tropy cS^(.) of an observations, from the standard Boltzmann entropy 
iS(.) of a probability density, introduced in sub-section|2.1| 


Figure 1: Q-Q plots against a Gaussian of the density 
of the negative log-likelihoods {—logp®(z^)}^ of the 
10000 MNIST test observations in layer 2 of a 5-layer 
standard feed-forward classifier net, see right branch of 
Figure and Appendix for implementation details. 
Layer sizes are 784-700-700-700-10. Learning rate = 
0.0015, decay = 500 epochs, batch size = 10000. For the 
right plot, dropout is 0.2 in input layer and 0.5 in hid¬ 
den layers. As an exception from the rules in Appendix 
[a] a tanh() activation function is used in the first hidden 
layer. Left. No dropout and no non-linearity: highly 
non-Gaussian. Right. With dropout and non-linearity: 
severely Gaussianized, especially for the intricates to¬ 
wards the right (see text). 

It is precisely the intricates, which - because of their 
low entropy and extreme non-Gaussianity, see Figure 
- are ideal candidates for “feature vectors” in classifica¬ 
tion tasks [Hyvarinen et'af [20091, section 7.9, 7.10. Their 
conjugates are then the “recemive fields” or “feature de¬ 
tectors’]^- see open problem [Tj in section]^ We show on 
the top (respectively, bottom) plot in Figurej^the 30 least 
(respectively, most) likely images in MNIST, in ascending 


Recall that, for a given row-vector observation , its conjugate is 
= x^C“^, where C is the covariance matrix. Up to a constant, the 
Gaussian negative log-likelihood is thus the inner product of an observa¬ 
tion and its conjugate, in the standard Euclidean metric: — logp'^jx^j) 
= 4 < Xj,,x„ >. 
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order of their entropy S^, from the class corresponding to 
the digit 8. 



Figure 2: Left. The hrst 5000 MNIST training images, 
projected on three of the least likely, i.e. most intricate 
conjugate images, ranked #3, #4, #6 in ascending order 
of Einstein entropy 5^(x^), (1.2 1 . This is a highly non- 
Gaussian 3-dimensional distribution. Right. The same 
MNIST images, projected on the three most likely con¬ 
jugate images, ranked # 4998, #4999, #5000 in Einstein 
entropy. Much more Gaussian-looking. 



Eigure 3; Top. The 30 lowest-entropy MNI ST tr aining 
images, using the Einstein entropy 5^(z^), (1.2i, from 
the class corresponding to the digit 8. They are quite in¬ 
tricate indeed. Bottom. The 30 highest-entropy MNIST 
training images from the same class. Much more vanilla¬ 
looking. 


1.6 Symmetries in the latent manifold. 

When the dimensionality Niat of the ACE latent layer is 
low, traversing the latent dimension in some uniform fash¬ 
ion describes the latent manifold for a given class. Eig¬ 


ure 0] shows the dominant dimension for each of the 10 
classes in MNIST. This so-called manifold learning by 
modern feed-forward nets was pioneered by the contrac¬ 
tive auto-encoders (CAE) [Rifai et al. | 2012| . A symme¬ 
try in our context is, loosely speaking, a one-dimensional 
parametric transformation, which leaves the cross-entropy 
unchanged. In probabilistic terms, this is equivalent to the 
existence of a one-parametric density, from which “sym¬ 
metric” observations are sampled, see •tn below. Nets 
currently learn symmetries from the training data, after 
it is artificially augmented, e.g. by adding rotated, trans¬ 
lated, rescaled, etc, images, in the case of visual recogni¬ 
tion. But once a symmetry is learned, it does not make 
sense to re-learn it for every new data set. 
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Eigure 4; Dominant dimension for each of MNIST ten 
classes, with each row corresponding to a separate class. 
While rotational symmetry dominates most classes, size 
i.e. scaling symmetry, clearly dominates the class of digit 
5. The net is an ACE in creative regime as in sub-section 
|3.3[ with an equally spaced deterministic grid in the latent 
layer {ctsIsILi, —6 < CTs < 6. Layer sizes 784-700-(l 
xl0)-(700xl0)-(784xl0) for the AE branch and 784-700- 
700-700-10 for the C branch, Eigure]^ and Appendix [A] 
learning rate = 0.0002, decay = 500 epochs, batch size = 
1000 . 

We hence propose the reverse approach: add the 
symmetry explicitly to the latent layer, alongside its 
Noether invariant, [Gelfand and Eomm] fll963| . Take for 
example translational symmetries in a two-dimensional 
system with coordinates G They 

imply the conservation of the horizontal and verti¬ 
cal momenta {—ifid/dz^^\—ifid/dz^'"'^) = 

G and a quantum mechanical wave function ~ 

where (h,v) are offsets, 
[Landau and Lifshjt^p977), section 15. After switching 


4 






















to imaginary time/space as in sub-section |1.2[ and setting 
h = 1, the corresponding conditional model density for a 
given observation/state /i is: 


~ e” 




(1.3) 


i.e. a two-dimensional Laplacian which htsj^in the Gibbs 
machine paradigm ( |2.7| i. We demonstrate in section 
how to build-in translational, scaling and rotational sym¬ 
metry in a net, by computing the symmetry statistics like 
explicitly and estimating the invariants with the 
rest of the net parameters. In general, they have to be re¬ 
fined via an optimization, as e.g. in Jadeberg et al. | 2015| . 
For more details, see|Georgiev||2015b|. 


versal net will hence tend to work better when the dimen¬ 
sion of latent layers Niat > N, i.e. have the so-called 
overcomplete representation Coates et al. | 2011| . When 
Niat » N, for any given A^-dimensional observation 
x^, only a small number of latents deviate sig- 

nihcantly from zero. For these sparse representations, 
sampling from high-entropy Gaussian-like densities, as 
on the right plot of Figure 2, is flawed. Sampling instead 
from “fat-tail” densities offers a significant performance 
improvement for MNIST, Figure right. As in math¬ 
ematical finance, stochastic volatility and jumps are ar¬ 
guably the first natural source of non-Gaussianity, and are 
almost fully-tractable. The q-Gibbs machines offer an¬ 
other venue, sub-section|2.3| 


1.7 ACE. 


1.7.3 Generative ACE. 


1.7.1 Non-generative ACE. 

In order to preserve the non-Gaussianity of the data, and 
improve performance significantly along the way, we will 
combine classifiers with auto-encoders - hence the name 
auto-classifier-encoder (ACE). Auto-encoders have a re¬ 
construction error in their cross-entropy optimization tar¬ 
get and thus force the net to be more faithful to the raw 
data. ACE simultaneously classifies and reconstructs, as¬ 
suming an independence between the two, and hence ad¬ 
ditivity of the respective cross-entropies: 


—logCACE = —logCAE — logCc- (1-4) 

In its first - non-generative - installment, ACE can do with 
a standard classifier and a shallow auto-encoder in the 
dual space of observations. It still beats handily the peers 
in its class, Eigurej^ right 

1.7.2 Non-Gaussian densities. 

In real-life data sets, the number of observations P —> oo, 
while the dimension of observables N is fixed. An uni- 

^ Technically, Laplacian is not in the exponential class, but it is a sum 
of two exponential densities in the domains (—oo, fj,), [/i, oo) defined 
by its mean /i, and those densities are in the exponential class in their 
respective domains. Laplacian is biologically-plausible because it is a 
bi-product of squaring Gaussians. 

®In its embryonic form, the shallow ACE seems to have first ap¬ 
peared in |Le et ar]|201l| , with the purpose of replacing orthogonality 
constraints in Independent Component Analysis. 


An even greater issue for current nets is the spontaneous 
“clumping” or clusterization which is prevalent in real-life 
data sets. Statistical mechanics deals with it by introduc¬ 
ing higher-hierarchy densities which are conditional on 
low-hierarchy densities. The Fermi density discussed in 
sub-section |3.1 1 is an example of a higher-hierarchy den¬ 
sity, built on top of the Boltzmann density, subject to ad¬ 
ditional constraints [Landau and LifshTE] |1980| , section 
53. Clusterization aggregates the low-hierarchy partial 
equilibria - the observations - into higher-hierarchy par¬ 
tial equilibria - clusters, sub-section |1.2| 

To mimic this universal phenomenon, in its second, 
generative, installment, ACE combines a classifier and a 
generative auto-encoder in the same space of observables, 
in a brand of an auto-encoder supervision, Eigure|^ ACE 
generalizes the classic idea of using separate decoders for 
separate classes, Hinton et al.j j 19951. In training, the 
conditional latent density p(z|x^) from sub-section 


1.2 


is generalized to p(z|x^, c^), where is the class or la¬ 
bel of the p-th observation. Since, of course, classification 
labels can not be used on the testing set, the sampling dur¬ 
ing testing is from a mixture of densities, with class prob¬ 
abilities supplied by the classifier (hence the dashed line 
in Figu re]^. Mixture densities in the posterior were also 
used in Kingma et al. | 2014| , albeit in a different architec¬ 
ture. The ACE is universal in the sense of subsection o 
and achieves state-of-the-art performance, both as a clas¬ 
sifier and a density estimator, Eigure Eor its relation 
with information geometry, see open problem|^in section 
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Figure 5; ACE architecture: AE stands for “auto-encoder”, C stands for “classifier”. Training is supervised i.e. labels 
are used in the auto-encoder and each class has a separate decoder, with unimodal sampling in the latent layer. The 
sampling during testing is instead from a mixture of densities p(z|x^) = class probabilities 

provided by the classifier, hence the dashed lines. 


0 

2 Theoretical background. 

2.1 Definitions. 

For discrete data, our minimization target - the cross¬ 
entropy •dD - can be decomposed as the sum: 

-^og C{r\\q) = S{r)+V{r\\q), (2.1) 


statistical physics, see e.g. |Naudts| pOTO) , to formally 
consider —T>{r\ |q) as a generalized entropy Sq{r) , for the 
case of a non-trivial base measure q{) - see (2.20i below. 

As discussed in sub-section |1.2[ the latents {z} in gen¬ 
erative nets are sampled from a closed-form conditional 
model density p(z|x). The latents are of course not given 
a priori, so the joint empirical density is: 


"(x, 2 .) = - x^)p(z|x), 


(2.4) 


of the Kullback-Leibler divergence 'D(r\\q) between em¬ 
pirical r() and model q() densities: 

V{r\\q) := Er(x)[logr(x) - logg(x)], (2.2) 

and the standard Boltzmann entropy S{r) of r(.): 

5(r) := Er(x)[-logr(x)], (2.3) 

5(.) ^ Because the entropy S{r) of r(.) does 

not depend on our model distribution q{.), the minimiza¬ 
tion of the cross-entropy is equivalent to minimizing the 
Kullback-Leibler divergence 'D{r\\q). It is common in 


with corresponding marginal empirical densit ies r(x) 
= - x^), r(z) = <|Kulhav5' 

1 1996[ , section 2.3; [Cover and Thornas| p056 |, problem 
3.12). Hence, the cross-entropy of gjx) := f p(x,z)dz 
= p(x, z)/p(z|x) is an arithmetic averag^across obser¬ 
vations —C{r\\q) = —p log9(x/i)- From the Bayes 
identity, we have for our optimization target ( fTT] ), in 


^This decomposition does not imply independence of observations: 
the latent variables can in general contain information from more 
than one observation, as for example in the case of time series auto¬ 
regression. 


6 

















































terms of the joint density: 


- logC{r\\q) := E^(x)[-log 9 (x)] = 

= Er(x,z)[-logp(x,z)] +Er(x,z)[logp(z|x)]. (2.5) 


From the explicit form (2.4 1 of r(x, z), for the /r-th obser¬ 
vation: 


-\ogq{x^) = X>(p(z|x^)||p(x^,z)) = 

= Ep(z|x^)[-logp(x^,z)] - 5(p(z|x^)), (2.6) 

where 5(p(z|x^)) = Ep(^|x^) [-logp(z|x^)] is the 

Boltzmann entropy of the model distribution, conditional 
on a given observation x^. If we sample the latent 
observables only once per observation, as commonly 
done, the right-hand side reduces to — logp(x^,z^) 
+ logp(z^|x^). 


2.2 Conditional independence. 


The hidden/latent observables z = are con¬ 

ditionally independent if, for a given observation x^. 


one has p(z|x^) = Y[j=i From the inde¬ 

pendence bound of the Boltzmann entropy iS(p(z|x^)) 
— Cover and Thomas 12006 , Chap- 


ter 2, conditional independence minimizes the negative 
entropy term on the right-hand side of p.6| l. Everything 
else being equal, conditional independence is hence opti¬ 
mal for nets. 


2.3 Exponential/Gibbs class of densities. 

There is a broad class of probability density families - 
Gibbs a.k.a. canonical or exponential families - which 
dominate the choices of model densities, both in physics 
and neural nets. This class includes a sufficiently large 
number of density families: Gaussian, Bernoulli, expo¬ 
nential, gamma, etc. Their general closed form is: 


Pa(z) 




(2.7) 


where p(z) is an arbitrary base or prior density, A = 
{As} are Lagrange multipliers a.k.a. natural parameters, 
A4j (z) are so-called sufficient statistics, and Z — Z(X) is 


the normalizing partition function. Knowing Z is equiv¬ 
alent to knowing the/ree energy T{X) = — logZ(A), 
which allows to re-write ( |2.7| l as: 

Pa(z) 

( 2 . 8 ) 

where A. AT is the scalar product of the vectors A = {As} 
and AT = {AIs}- 


2.4 Macroscopic quantities. 


In physics, the expectations of the sufficient statistics m 
= Ep^(z)[AT(z)] form a complete set of macroscopic 
a.k.a. thermodynamic quantities or state variables like 
energy, momenta, number of particles, etc, fully describ¬ 
ing the /i-th conditional state, sub-section |1.2[ see Lan- 


|dau and Lifshitz| | [l980| , sections 28,34,35,110. In neural 
nets, the sufficient statistics are typically monomials like 
A^i(z) = z, Af 2 (z) = z^, etc, whose expectations form 
a vector of moments. As proposed in sub-section 1.6 one 


can add them to the list the symmetry statistics, see sec¬ 
tion]^ for details. From the dehnition of free energy, one 
derives immediately that it is a generative function of the 
expectations m: 


dF{X) 

dX 


= m(A), 


(2.9) 


and, more generally, of their higher n-th moments: 

dF{X) 

dXidXj...dXk 


= (-l)”+iEp^(^)[Al,(z)Afj(z)...Affe(z)]. 

( 2 . 10 ) 


2.5 Variational Pythagorean theorem. 

The exponentiaFGibbs class of families is special because 
it is the variational maximum entropy class: when the 
base density p() is trivial, it is the unique functional form 
which maximizes the Boltzmann entropy S{f) across 
the universe {/(z)} of all densities with given macro¬ 
scopic quantities m = Ej^[A4(z)], see Cover and Thomas 
120061, chapter 12. The natural parameters A = A(m) are 


computed so as to satisfy these constraints. They are La¬ 
grange constraints multipliers in the variational calculus 
derivation of the maximum entropy property. 
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The Gibbs class is special in even stronger sense; it is a 
minimum divergence class. For an arbitrary base density 
p(z), the Kullback-Leibler divergence 'D{px{z)\\p{z))) 
minimizes the divergence 'D{f{z)\\p{z)) across the uni¬ 
verse {/(z)} of all densities with given macroscopic 
quantities m = E/[AT(z)]. This follows from the vari¬ 
ational Pythagorean theorem. Figure [6] [Chentsov 1 1968| , 
Kulhavy P996| , section 3.3; 


25(/(z)lb(z)) = 25(/(z)lbA(z)) - 

> 2?(PA(z)|b(z)). 


■I?(pa(z)||p(z)) > 
( 2 . 11 ) 



Figure 6; A naive visualization of the probabilistic (vari¬ 
ational) Pythagorean theorem from (|2.11[). 


2.7 Conditional latent densities. 


For a given observation x^, choosing a conditional latent 
density from the Gibbs type ( |2.8| l is equivalent to; 

p(z|x^) (2.13) 


where the superscript ‘gen’ is short for generative. We 
will see shortly that the free energy jjgga- 

tive generative error —from the previous sub-section 
are concave conjugates, hence the shared superscript. In 
practice, in order for p(z|x^) to be tractable, p{z) has 
to be from a specific parametric family within the Gibbs 
class; then both p{z) and p(z|x^) will be tractable and in 
the same family, e.g. Gaussian, exponential, etc. 

Except for the symmetry statistics introduced in sec¬ 
tions |1.6l PI the macroscopic quantities m(x^) = 


E. 


'p(z|x^) 


the macroscopic quantities 
pwTz )] for the /i-th quantum state are free pa¬ 


rameters. Together with the the symmetry statistics, they 
can be thought of as quantum numbers distinguishing the 
observations, a.k.a. partial equilibrium states, from one 
another, in the spirit of quantum statistics Landau and Lif-| 
|shitz| |1980| , section 5, see sub-section |1.2| here. These 
quantum numbers are added to the rest of the free net pa¬ 
rameters, to be optimized by standard methods, like back- 
propagation/ stochastic gradient descent, etc. 


2.6 Generative error. 


As we will see in sub-section 3.5 minimizing the diver¬ 


gence I?(/(z)||p(z)) across an unknown a priori family 
of conditional distributions p(z|x^) = /(z), is crucial for 
the quality of a generative net. The smaller this diver¬ 
gence, the more likely are the sampled from p{z) newly 
created objects to resemble the training set. The minimum 
divergence property ( |2.11[ ) implies that we are always bet¬ 
ter off choosing p(z|x^) from a Gibbs class, as in (2.13 i. 


hence the name Gibbs machines. We will refer to the min¬ 
imum divergence I?(p(z|x^)||p(z)) as generative error: 


X>s®"(m(x^)) ;= X>(p(z|x^)||p(z)) > 0. (2.12) 

The lack of explicit dependence on the natural parameters 
A on the left is because depends on them only in¬ 
directly, via the macroscopic quantities m = ni(A) (see 
sub-section|2.9|l. 


2.8 Boltzmann-Gibbs thermodynamic iden¬ 
tity. 

The identities relating various macroscopic (thermody¬ 
namic) quantities are referred to in statistical physics 
as thermodynamic identities. Recall that the classic 
Boltzmann-Gibbs distribution is a special case of the ex- 
ponentiaFGibbs distribution ( |2.8| l, with a trivial base den¬ 
sity p(), and only one sufficient statistics - the microscopic 
energy or Hamiltonian ): 

p^®(z|x^) = (2.14) 

where fi := is the inverse temperature and the only 
natural parameter. Taking logs and expectations w.r.t. 
p^‘^(z|x^), one gets the classic thermodynamic identity 
for the free energy and entropy 

5(p^‘^(z|x^)); 

^ I31(BG,^P^ _ s[U^(^{l3)), (2.15) 
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where the only macroscopic quantity - the energy 
- is the expectation := EpBG(z|xj^)[’H(x^, z)]. The 
entropy S depends on /3 only indirectly, via the energy i.e. 

S = S{U^^{P)). 

The increase of free energy has the physics interpreta¬ 
tion of work needed to be done by the outside environ¬ 
ment, in order to increase the macroscopic quantity of the 
system, i.e. the energy in this case. The negative sign in 
( |2.15| l confirms the intuition that, for the same increase of 
internal energy , less ordered systems need less work 
from the outside. 


2.9 Generative thermodynamic identity. 


As mentioned in sub-section |2.1| in a more general con¬ 
text like (2.13 i, the negative generative error — 


plays the role of generalized entropy 5p(p(z|x^)), with 
base measure p(z). As is standard in statistical mechan¬ 
ics, it can be shown ( |Kulhavy 1 1996) , Chapter 3) that 
—is a concave function, expressable via the 
Legendre transform of its conjugate - the generative free 
energy 


= mm{A.m- J's®"(A)}, (2.16) 

or, equivalently, 

= max{-A.m -f J'9'=’^(A)}, (2.17) 


where A.m is a scalar product of the vectors A and m and 
A is allowed to run free. Note that, while the variational 
Pythagorean theorem establishes the minimum property 
of as afunctional I?®®" (/| |p) over the space of func¬ 
tions {/(z)}, is a maximum when viewed as an ex¬ 
plicit/wncfion A) of both the macroscopic quan¬ 

tities m and the natural parameters A. Equivalently, 


^■^^(A) = min{A.m -f 17®™(m)}, (2.18) 

m 

where m is allowed to run freely. At the optimal point 
(“equilibrium”) m®®", this implies the dual identity of 

& 

=: (2.19) 


dm 


Similarly to (2.91, the derivative on the left-hand side de¬ 


fines the function A®®" = A®®"(m) as a function of m 
everywhere. While generative free energy was defined in 
( |2.18| l for every A, unless the function in ( 2.19| l is invert¬ 
ible, its image is only a subset of the full space of natural 
parameters A. Assuming such invertibility, the general¬ 
ization of ( |2.15| l is; 

J^9""(A) = A.m®™(A) + I?9"’"(m9^”(A)), (2.20) 

where we skipped for brevity the dependence on x^. 
Counter to classic thermodynamics ( 2.15| l, we have a 
plus sign? The two factors comprising the divergence 
I?®®"(II), clearly work against each other. On the one 
hand, the negative entropy term —S{p{z\x^)) < 0 re¬ 
duces free energy as usual; less ordered systems take 
less work to create. On the other hand, when the system 
resides in an infinitely-large “thermostat” of non-trivial 
density p(z), the cross-entropy term Ep(z|x ) [— logp(z)] 
> 0, increases the free energy back up. It measures the 
amount of work it takes to counter the thermostat’s influ¬ 
ence. 


The chain rule and (2.9 1 , (2.191, give in addition the 
indirect dependence of ps®" on the natural parameters A; 


aX>s™(m(A)) 

dX 


= -A. 


9m(A) 

dX 


= -A. 


d^jrgen^X) 


dXdX 


( 2 . 21 ) 


From (2.101, this derivative is conveniently expressed in 
terms of the matrix of second moments of the sufficient 
statistics; 


ai?9®”(m(A)) 

aA 


= A.Ep(,|x )[7W(z)M(z)]. (2.22) 


2.10 q-Gibbs densities. 

Gibbs densities are only a special case (for g = 1) of the 
broad class of q-Gibbs (or q-exponential) densities. The 


corresponding nonextensive statistical mechanics Tsallis 
P009[ , describes more adequately long-range-interacting 
many-body systems like typical human-generated data 
sets. By virtue of replacing the exponential with a q- 
exponential, log with q-log and defining a respective q- 
entropy, most of the formalism of classic thermodynam¬ 
ics has been generalized. Many of the properties of the 
exponential class remain true for the q-exponential class 
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Naudts 120101, Amari and Ohara | 2011| , see open prob- 
lem|4]in sectionlbl 


3 Application to generative neural 
nets. 


3.2 Reconstruction error. 

A neural net is said to have reconstruction capabilities, if 
it has a decoder, which for a given latent z, can assign a 
reconstruction density p’'®'^(x^|z) of an observation x^. 
Following standard procedure, the reconstruction error is 
the usual cross-entropy ( |l.l| i of this reconstruction density 
and the empirical densityiZ4|: 


A fully-generative net creates original observations by 
sampling from an unconditional model density p(z), un¬ 
encumbered by any observations {x^}. Perturbative nets 
on the other hand, do not contain an unconditional model 
density p{z). They rely instead on an initial observation 
x^, a conditional density p(z|x^), and the decomposition 
(|2.6|l for parameter estimation. 


3.1 Perturbative nets. 


The most successful family of perturbative nets to date 
have been the Boltzmann machines Smolensl^ l 1986| and 
their multiple re-incarnations. Assuming completeness 
of the state variables, Boltzmann machines adopt as joint 
model density the Boltzmann density, a special case of the 
Bol tzmann-Gibbs equilibrium density from sub-section 

with a single suf- 


2.3 p^‘^(x^,z) = 
heient statistics - a bi-linear Hamiltonian function 'H(,), 
a trivial base density and temperature T = 1. It is not 
tractable because the partition function -Z(x^) can not be 
computed in closed form. On the other hand, for discrete 
data, the conditional density of the restricted Boltzmann 
machines (RBM) is tractable and is the familiar Fermi 
density p^(z|x^) = 1/(1-f The intractable 

joint density term in ( |2.6[ ) is handled by approximations 
of its gradients like contrastive divergence Hinton | 2002| , 
to avoid brute-force Monte Carlo methods averaging im¬ 
possibly many paths. 

Due to the simple bi-linear shape of the Hamiltonian, 
RBM-s have latent variables conditionally independent 
on the visible variables (and vice versa). Despite their 
limitations when handling non-binary data, deep Boltz¬ 
mann machines Salakhutdinov and Hinton| O2009| have 
until recently been the dominant universal nets: They per¬ 
form well both as classihers [Srivastava et al.| ||2014| and 


probability density estimators Salakhutdinov and Murray 

||2008l . 


-logr-(x^) :=Ep(,|.^)[-logp’'“(> 


|z)] > 0. 

(3.1) 


Reconstruction densities are typically from the exponen¬ 
tial/Gibbs class - Bernoulli, Gaussian, etc - with unity co- 
variance matrix, and a trivial base density p{). But the 
key reconstruction macroscopic quantity - the expecta¬ 
tion m’'®'^(x^) = Ep(z|x^) [xp] - is generally an intractable 
function of z, given by the net decoder. 

The cross-entropy — log depends on the genera¬ 
tive natural parameters only via the expectation density 
p(z|xp), and its derivatives are from (3.1 1 : 


aiog/:’'®®(m(A)) 


= E, 


'p(z|x^)[>t(z) logp’'“(x^|z)]. 

(3.2) 


3.3 Fully-generative nets. Regimes. 

We will distinguish two separate regimes of operation of 
a fully-generative net: 

1. non-creative regime: This is the common regime 
for net training, validation or testing. Latent observ¬ 
ables are sampled from a closed-form model condi¬ 


tional density p(z|x), as in sub-section 2.7 with oh 


servations {x^} attached to the net. A closed-form 
reconstruction model density p’"®®(x|z), as in sub- 
section [3^ is also chosen. 

2. creative regime: The net has been trained and la¬ 
tent observables z = {zj}^lf are sampled from a 
closed-form model density p(z), unencumbered by 
observations {x^}. The same reconstruction density 
p’'®®(x|z) as in the non-creative regime is used. 

The joint density is chosen to be: 


p(x,z) :=p(z)p™®(x|z). 


(3.3) 


The empirical densities are the same as in sub-section|2.1 
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3.4 Variational error. 

The implied marginal density q{x.) := f p(x, z)dz and 
the implied conditional density q{z\x) = p(x, z) /q{x) are 
generally intractable in fully generative nets. Moreover, 
the implied conditional q{z\x) is of course different than 
our chosen a priori p(z|x) in the non-creative regime. For 
a given observation x^, the divergence between the two is 
called variational error: 


I?’'-(x^) = Vipiz\x^)Mz\x^)) > 0. (3.4) 

Its optimization is the subject of the so-called fixed-form 
variational Bayes analysis Saul and Jordan 1 1995) . We 
will discuss its estimation in sub-section |3.8| see also open 
problem]^ section]^ 


3.5 Cross-entropy decomposition. 

Following standard procedure, our training minimization 
target is the cross-entropy between the marginal den¬ 
sity ^(x) = f p(x,z)dz and the non-creative empirical 
density ( |2.4| i. Expanding the joint density p.3| l in both 
Bayesian directions, one can decompose in terms of the 
implied conditional q(zjx^) as follows: 


-log/:(r||g) := E^( 2 )[-logp(z)]-f 
+ Er(x,z)[-logp’'®''(x|z)] -f E^(x,z)[log9(z|x)]. (3.5) 


From the explicit form (2.4 1 of r(x, z), for the /r-th obser¬ 
vation X,,: 


1. > 0: the generative error \2.\2\ is the di¬ 
vergence between the generative densities in the 
non-creative and creative regimes. Minimizing it 
ensures the general similarity of objects generated 
in the two regimes. It can be interpreted as the 
hypotenuse in the variational Pythagorean theorem 
( 2.11| i and is computable in closed form for many 
Gibbs/exponential densities. 

2. — logE’'®'^ > 0: the reconstruction error iD mea¬ 
sures the negative likelihood of getting x^ back, after 
the transformations and randomness inside the net. 
It can be computed by any net endowed with a de¬ 
coder via standard Monte Carlo averaging, as tradi¬ 
tional auto-encoders do. Importantly for training, in 
order to compute gradients with respect to the gener¬ 
ative macroscopic quantities m®®", a change of vari¬ 
able is needed, replacing sampling from p(z|x^) by 
sampling from p{z). Such transformations exists for 
many of the exponential/Gibbs probability families 
Kingma and Welling 1 2014| . 

3. > 0: the variational error measures 
the divergence between our chosen functional form 
p(z|x^) for latent density and the implied by the net 
latent density g(z|x^). Due to the intractability of 
g(z|x^), the variational error has to be computed nu¬ 
merically via Monte Carlo methods, see sub-section 
3.8 and open problem]^ section]^ 


3.6 Cross-entropy upper bound. 


- log g(x^) = Ep(^ix^) [- logp(z)]-f 
+ Ep(^|x^)[- logp’'“(x^|z)] -f Ep(^|x^)[logq(z|x^)]. 

(3.6) 


Subtracting 5(p(z|xp)) from the first term and adding it 
to the third, and using the definitions ( 2.12\ of generative 
error (3.1 1 of reconstruction error — logC'’®'^ and 

p.4| l of variational error the final expression for 

our minimization target is: 


-logg(xp)=2?9'='^(xp)-log/:" 


= (xp)-2?"“’-(x 


m)- 

(3.7) 


Dropping the variational error in p.7| i yields an upper 
bound B{) for the cross-entrop}|^— log 9 (xp): 


S(xp) :=2?9^-(xp)-log£"“(xp) = 
= 77(p(z|xp)||p(xp,z)). 


(3.8) 


While the last expression is formally equivalent to the 
general expression ( 2.61 for — logg(xp), the density 


p(z|x^) in ( 2 . 61 is the correct conditional density of the 
joint density p{x^, z), while p(z|x^) here is merely an 
approximation to the implied conditional q(z|x^). 

From ( 2.22|l and p.2|i, the derivative of the upper bound 


Let us highlight the essence and computability of each of *This is an expanded version of the textbook variational inequality 
the three components: [Cover and Thomas ||2006), Exercise 8.6. 
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with respect to the generative natural parameters is: 

+Ep(z|x^)[logp’'''''(x^|z)M(z)]. (3.9) 

At optima or inflection points of B{\), the derivative is 
zero and, if the moment matrix Ep( 2 .|x ) [Al(z)Ad.(z)] is 
invertible, the generative natural parameters become: 

A = -Ep(2|x^)[logp’'®''(x^|z)M(z)]. 

Ep(z|x^)[-M(z)M(z)]"^ (3.10) 


3.7 Gibbs machines. 


It is clear from the above derivation that ( |3.7| l, ( |3.8| ) are 
universal for fully-generative nets and were hence used 


in the first fully-generative nets, the VAE-s Kingma and 
|Welling| ]2014| , [Rezende et al.| | |T014| . The VAE-s owe 
their name to the variational error term and were intro¬ 
duced in the context of very general sampling densities. 
Everything else being equal, the variational Pythagorean 
theorem ( |2.11[ ) implies that latent sampling densities 
( 2.13| l from the Gibbs class minimize the generative error. 
Hence we call the respective nets Gibbs machines. While 
the variational error is due to an approximation, the vari¬ 
ational principle from which the Gibbs class is derived, is 
fundamental to statistical mechanics. 


3.8 Estimating variational error. 


A closer look at the equilibrium identity for the natural 


parameters (3.101 reveals that it is identical to the max¬ 
imum likelihood estimates of the coefficients of a linear 
regression: 

logp™'=(x^|z) = a- A.M(z) -f e(x^,z), (3.11) 

where a is an intercept and e an error term. This ob¬ 
servation was first made in Richard and Zhang 1 2007) 
and later used in [Salimans and Knowlesj 1 2013| to esti¬ 
mate variational error in the context of Variational Bayes. 
Adding logp(z) to both sides, adding/subtracting 
to the right side, and recalling ( 2.13| l, p.3| l, transforms 
( |3.1 l| l into: 

logp(x^, z) = a - + logp(z|x^) -f e(x^, z). 

(3.12) 


Requiring Ep( 2 .|x;^) [^(x/i, z)] = 0 yields for the regression 
intercept the last expression in ( |3.8[ ), with a negative sign: 

logp(x^, z) = -S(x^) -f logp(z|x^) -f e(x^, z). 

(3.13) 

The implied marginal density ( 7 (x^) := f p(x^,z)dz is 
now: 


q(x^) = (3.14) 

hence the cross-entropy for the /r-th observation is: 
-log(?(x^) = S(x^) - logEp(^|x^)[e"^’''^’"’^], (3.15) 

i.e. 


I?"“’-(x^)=logEp(,|,,^)[e^(’'-=')]. (3.16) 


This can be estimated either via Monte Carlo methods, or 
in a closed form. Assuming for example that e(x^, z) ~ 
(t(x^)^) is a Gaussian with variance tT(x^)^, yields 


27““" (xp) = 


g'(xn) 

’'p; — 2 

While not rigorous, it is interesting to use these 
25 Dar(xp) estimates, to train the neural net with the full 
cross- 


s-entropy —^(xp) from E). and not just its upper 
bound ;B(xp) from (|3.8| - see open problem]^ section]^ 


3.9 Generative conditional independence. 


Without offering a rigorous proof, we believe that the con¬ 
ditional independence argument from sub-section 2.2 can 


be generalized to this context: Eor a given reconstruc¬ 
tion error, the generative error 779®" and hence the up¬ 
per bound ( |3.8| ) is minimized when the latent variables 
are conditionally independent. Eor Gaussian multi-variate 
sampling, this follows from the explicit form of the gen¬ 
erative error IGil et al.|||20T3l , table 3, and Hadamard’s 
inequality [Cover and Thomas] |2006|, chapter 8. 


4 Latent symmetry statistics. Mo¬ 
menta. 

We will show for brevity only how to build spatial invari¬ 
ances in a 2-dim square visual recognition model. Eor 
real-life data sets, the symmetry statistics have to be com¬ 
puted via another net, seejGeorgievjpOlSb). 
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Every observable i.e. pixel x,; ,i = 0,...,N, can be 
assigned a horizontal hi and a vertical Vi integer coordi¬ 
nates on the screen, e.g., hi,Vi G {1,v/iV}- In these 
coordinates, a row-observation becomes 

a matrix-observation „.} and a net layer of size N 

becomes a layer of size \/]V x '/N. The center of mass 

{hfj., Vfi)'- 


:= 


a;. 


[iihi 


^i=l X^iVi 




fii 




'fit 


1.6 


2.3 


Ch. 


fit 5 ^flt 


) := (/ij - hfj,,Vi - v^), 


-s/N < h 




< s/N. In this coordinate system, ev¬ 


ery pixel has polar coordinates := y h'j^i 


/i2’ 




i-i ill 


Y.I- 


‘Pii '■= 


Ei 


{h 


fit 5 ^^t 


) ■— (It/xij 


/ cos(-ip^) Sin(-(p^) 
V sM<Pii) cos(-(ff,) 


i hfj,i -f {Vfj^i — 1)(2M -f 1). 


zero observables for that observation. In the coordinates 
(4.4 1 , a layer of size iV becomes a layer of size (2M-(-l)^. 


(4.1) 


of every observation defines latent symmetry st atistics h 
= and v = see sub-sections 


In summary: i) for auto-encoders, apply mapping ( |4.5[ ) 
at the input and its inverse at the output of the net; ii) for 
classifiers, apply mapping ( |4.5| l at the input only; iii) for 
both, include in addition the symmetry statistics h, v, r, cp 
in the latent layer, if needed, see Georgiev | 2015b| . The 
prior model density p() of symmetry statistics can be as¬ 
sumed equal to their parametrized posterior p(.|x^), but 
there are other options. 

When sampling the symmetry statistics from indepen¬ 


dent Laplacians as in (1.3i e.g., the respective density 


Without loss of generality, we assumed here that > U, 
hence 0 < < s/N. In the coordinate system cen¬ 
tered by we have for every p, the new coordi¬ 

nates: 


means are set to be r^, from (4.1 1 , (4.31. The 

density scales cr^, on the other hand are free 

parameters, and can in principle be optimized in the non- 
creative regime, alongside the rest of the net parameters. 


sub-section 2.7 As argued in sub-section 1.6 the inverted 


(4.2) 


Pfii 


scales are the scaled momenta. In the creative regime, 
when sampling e.g. from h alone, one will get horizon¬ 
tally shifted identical replicas. See open problem]^ sec¬ 
tion 


:= atan2(y^i,h^i) G (—7r,7r], hence the new symmetry 
statistics: scale r = and angle = {PiL}^=i. 


5 Experimental results. 


(4.3) 


The Theano Bastien et al. 
ments is in Georgiev 1 2015a|, s^ also Popov ]2015|. 


code used for experi- 


0 < < s/m, —TT < tp/j, < t:. In order to un-scale and 

un-rotate an image, change coordinates one more time to: 


5.1 Non-generative ACE. 

The motivation for the non-generative ACE comes from 
the Einstein observation entropies {—logp‘^(x^)}p, as 


(4.4) 

in sub-section 1 1.4[ and their relation to singular value de- 


—M < hfii,Vfj_i < M, for some constants C, M depend¬ 
ing® on minr„, maxr„. After rounding and a shift, we 

thus have for any observation p, an index mapping: 

TV} ^ (2M +1)2} 


(4.5) 


When2M-|-l > -y/ZV, the (2M-|-l — -\//V)^ indexes which 
are not in the mapping image, correspond to identically 


®The scale typically needs to have a lower bound, in order to 
ensure that M is of the same order of magnitude as vTV. 


composition (SVD). Recall that the SVD of the B x N 
data matrix X with B observations and N observables is 
X = VAW^. The B x B matrix VV^ is i) a projection 
mapping; ii) its diagonals are up to a constant the negative 
Einstein observation entropies from sub-section ( |1.4| i, i.e., 
the Gaussian log-likelihoods {— logp‘^(x^)}^; and iii) it 
is invariant on X, i.e. X = VV®"X. 

Let us consider a shallow auto-encoder, Eigure 1^ left, 
and its dual in the space of observations, Eigure [fright. 
It can be shown that for tied weights in the 

absence of non-linearities, the optimal B x Ni at hidden 
layer solution Ho on the left is Ho = = v Georgiev 
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Figure 7: Shallow auto-encoder in the space of observables (left) and observations (right). Minimization targets are the 
reconstruction errors in the respective spaces — log Crecon and — log C'^econ defined for binarized data in Appendix 
[^ (/), 93 are non-linearities. 


1 2015c| . The divergence of VV^X from X is thus the re¬ 
construction error in the dual space and minimizing it gets 
us closer to the optimal Hq. If we treat the hrst hidden 
layer H of a classifier as the rescaled dual weight matrix 
, we arrive at the dual reconstruction error ; 


1 

log ^recon ^ExJ-log^(HH^x,/i?)], 

(5.1) 


for a given column-vector observable and 

sigmoid non-linearity ip{), see Appendix [A| 


The non-generative ACE has as minimization target the 
composite cross-entropy (1.4 1 , with — log Cae replaced 
by the dual reconstruction error ( |5.1[ ). The orthogonality 
V^V = iNiat of V implies the need for an additional 
batch normalization, similar to Ioffe and Szegedy | 20I5| , 
see Appendix [A| 


The best known results for the test classification error 
of feed-forward, non-convolutional nets, without artiheial 
data augmentation, are in the 0.9-1% handle [Srivastava 


et al. |2014|, table 2. As shown on the right of Figure 


the non-generative ACE offers a 20-30% improvement. 



Figure 8: Left. The same Q-Q plot as on the right of Fig¬ 
ure [2 but for the non-generative ACE, sub-section im 
with the same hyper-parameters as on the right of of Fig¬ 
ure [T] Right. Classification error for the MNIST 10000 
test set, as a function of training epochs, i.e., one full 
swipe over all training observations. The top line is the 
standard classiher as on the right of Figure The bottom 
line is the classification error of the non-generative ACE 
with the same hyper-parameters. 

5.2 Generative ACE. 

The architecture is in Figure]^ the minimization target is 
the ACE cross-entropy ( |1.4| l, with — log Cae replaced by 
the upper bound ( |3.8| l. Laplacian sampling density is used 
in training and the mixed Laplacian in testing, with the 
explicit formulas for the generative error in Appendix 
The generative ACE produces similarly outstanding clas- 


14 

























































sification results as the non-generative ACE on the regu¬ 
lar MNIST data set. Figure left. Even without tweak¬ 
ing hyper-parameters, it also produces outstanding results 
for the density estimation of the binarized MNIST data 
set. Figure right. An upper bound for the negative log- 
likelihood in the 86-87 handle is in the ballpark of the best 


non-recurrent nets, Gregor et al. 120151, table 2. 



features Hyvarinen et al. 120091. 


either in closed form, or from specialized nets, as in 
Jadeberg et al. | 2015| . For the distorted MNIST and 


CIFARIO datasets, see |Georgiev| | |20l5b) . 

3. Deepen and make generative the shallow dual en¬ 
coder of the non-generative ACE, sub-sections [13 
I5.1l here. 

4. Test empirically q-Gibbs machines, with cross¬ 
entropies replaced by their q-equivalents and sam¬ 
pling from q-Gibbs densities, sub-section |2. 10 here. 


Figure 9; Left. Classification error for the MNIST 
10000 test set. Top line is from a standard classifier 
net as on the right of Figure [T] Bottom lines are from 
generative ACE in classification mode: Gaussian sam¬ 
pling (red) and Laplacian (green). Layer sizes 784-700- 
(100xl0)-(700xl0)-(784xl0) for the AE branch and 784- 
700-700-700-10 for the C branch. Figure]^ and Appendix 
[a] learning rate = 0.0015, decay = 500 epochs, batch size 
= 10000. The dual reconstruction error — log CJ^econ from 


5. Improve the upper bound for the generative error 
of mixture densities in Appendix [B] by using vari¬ 
ational methods as in Hershey and Olsen | |2007| . 

6 . How is the ACE blend of exponential and mixture 
densities related to the beautiful duality between 
these two families, underlying information geometry 
Amari and Nagaok^l 2001|, section 3.7? 


sub-section 5.1 is added to the overall cost. Right. Up¬ 


per bound (3.8 1 of the negative log-likelihood for the bi¬ 


narized MNIST 10000 test set. The top line is the stan¬ 
dard Gibbs machine with Gaussian sampling, layer sizes 
784-700-400-700-784 and other hyper-parameters as be¬ 
low. The middle line is the same net but with Laplacian 
sampling. The bottom line is generative ACE with Lapla¬ 
cian mixture sampling. Layer sizes 784-700-(400xl0)- 
(700xl0)-(784xl0) for the AE branch and 784-700-700- 
700-10 for the C branch. Figure]^ and Appendix [A] learn¬ 
ing rate = 0.0002, decay = 500 epochs, batch size = 1000. 

6 Open problems. 

1. Use the freely available intricates, sub-section o 
directly as feature detectors, in lieu of artificially 
computed Independent Component Analysis (ICA) 


2. Test empirically the performance of ACE, with the 
symmetry statistics added as in section]^ computed 


7. Estimate the variational error (^l [Richard and| 
|Zhang|p007| , |Salimans and Knowles [20131 and use 
it in training to minimize the full cross-entropy, not 
merely its upper bound, as suggested in sub-section 
I3.8l here. 
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Appendices 

A Software and implementation. 

The optimizer used is Adam |Kingma and Ba| [2015| 
stochastic gradient descent back-propagation. Specific 
hyper-parameters are in the text. We used only two 
standard sets of hyper-parameters, one for the classifier 
branch and one for the auto-encoder branch, no optimiza¬ 
tions. 

For reconstruction error of a binarized /r-th row- 
observation and its reconstruction x^, 

we use the standard binary cross-entropy [Bastien et al.| 
1 20121 : Ex^[-logx^] = log^^i -(1 - 

— logi^i)), using a sigmoid last non-linearity 
ip. The batch cross-entropy is — log Crecon — 5 
Ex [— logx^]. In the space of observations, as on right 
plot of Figure |7j the dual reconstruction error is the same 
binary cross-entropy ExJ— logXi], but for the i-th ob¬ 
servable Xj = {Xf^i}^^i and a sum over p instead of 
i. The batch cross-entropy is —logC'^^^^^ = 

Exi [— logXi], with a normalization factor conforming to 
the space of observables. 

The non-linearities are tanh() in the auto-encoder 
branch and two-unit maxout [Goodfellow et al.| O2013| in 
the classifier branch. Weight matrices of size P x N are 
initialized as random Gaussian matrices, normalized by 
the order of magnitude + \/N of their largest eigen¬ 
value. As discussed in sub-section HD hidden observ¬ 
ables in the first and last hidden layer of classifiers are 
batch-normalized i.e. de-meaned and divided by their sec¬ 
ond moment. Unlike [Ioffe and Szegedy|p015| , batch nor- 
malization is enforced identically in both the train and test 
set, hence test results depend slightly on the test batch. 


ay/0.5). The generative error in (|3.8[) equ als: 


Gil 


— logCT +\fi\/V0^ +aexp (—|/r|/(o-\/0.5)) — 1, see 
[H’aL] 120131 , table 3. 

For the divergence between a mixture prior 
J2s°^sPsiz) and a mixture posterior J2s'^sPs{z\-) 
with the same weights {as}s, we use the upper bound 
V {ps{z\.)\\ps{z)) implied by the log sum in 


equality [Cover and Thoma^ |2006| . For improvements, 
see open problem]^ in section]^ 


B Generative error formulas. 

When sampling from a Laplacian, in order to have an 
unity variance in the prior, we choose for the independent 
one-dimensional latents a prior p{z) = p^°'P{z; 0, VO.fi), 
where p, 6 ) = exp{—\z — p\/b)/{2b) is the stan¬ 

dard Laplacian density with mean p and scale b. In or¬ 
der to have zero generative error when (/i, a) —^ (0,1), 
we parametrize the conditional posterior as p{z\.) = 
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